﻿WEBVTT

00:00:08.355 --> 00:00:11.357
- Okay, it's after 12, so I think
we should get started.

00:00:14.644 --> 00:00:17.419
Today we're going to kind of pick up
where we left off last time.

00:00:17.419 --> 00:00:23.400
Last time we talked about a lot of sort of tips and tricks
involved in the nitty gritty details of training neural networks.

00:00:23.400 --> 00:00:30.439
Today we'll pick up where we left off, and talk about a lot more
of these sort of nitty gritty details about training these things.

00:00:30.439 --> 00:00:34.707
As usual, a couple administrative notes
before we get into the material.

00:00:34.707 --> 00:00:39.645
As you all know, assignment one is already
due. Hopefully you all turned it in.

00:00:39.645 --> 00:00:57.322
Did it go okay? Was it not okay? Rough sentiment? Mostly okay. Okay, that's good. Awesome. [laughs] We're in
the process of grading those, so stay turned. We're hoping to get grades back for those before A two is due.

00:00:57.322 --> 00:01:04.121
Another reminder, that your project proposals
are due tomorrow. Actually, no, today at 11:59.

00:01:04.959 --> 00:01:09.074
Make sure you send those in.
Details are on the website and on Piazza.

00:01:09.074 --> 00:01:15.269
Also a reminder, assignment two is already
out. That'll be due a week from Thursday.

00:01:15.269 --> 00:01:25.860
Historically, assignment two has been the longest one in the class, so if you haven't
started already on assignment two, I'd recommend you take a look at that pretty soon.

00:01:27.122 --> 00:01:32.484
Another reminder is that for assignment two, I
think of a lot of you will be using Google Cloud.

00:01:32.484 --> 00:01:38.586
Big reminder, make sure to stop your instances when you're not
using them because whenever your instance is on, you get charged,

00:01:38.586 --> 00:01:42.899
and we only have so many coupons
to distribute to you guys.

00:01:42.899 --> 00:01:52.223
Anytime your instance is on, even if you're not SSH to it, even if you're not running things
immediately in your Jupyter Notebook, any time that instance is on, you're going to be charged.

00:01:52.223 --> 00:01:57.118
Just make sure that you explicitly stop
your instances when you're not using them.

00:01:57.118 --> 00:02:04.970
In this example, I've got a little screenshot of my dashboard on Google Cloud.
I need to go in there and explicitly go to the dropdown and click stop.

00:02:04.970 --> 00:02:08.644
Just make sure that you do this when
you're done working each day.

00:02:09.481 --> 00:02:20.853
Another thing to remember is it's kind of up to you guys to keep track of your spending on Google
Cloud. In particular, instances that use GPUs are a lot more expensive than those with CPUs.

00:02:20.853 --> 00:02:28.322
Rough order of magnitude, those GPU instances are around 90
cents to a dollar an hour. Those are actually quite pricey.

00:02:28.322 --> 00:02:39.739
The CPU instances are much cheaper. The general strategy is that you probably want to make two instances,
one with a GPU and one without, and then only use that GPU instance when you really need the GPU.

00:02:39.739 --> 00:02:47.377
For example, on assignment two, most of the assignment, you should
only need the CPU, so you should only use your CPU instance for that.

00:02:47.377 --> 00:02:52.990
But then the final question, about
TensorFlow or PyTorch, that will need a GPU.

00:02:52.990 --> 00:02:58.897
This'll give you a little bit of practice with switching between
multiple instances and only using that GPU when it's really necessary.

00:02:58.897 --> 00:03:04.307
Again, just kind of watch your spending.
Try not to go too crazy on these things.

00:03:04.307 --> 00:03:07.748
Any questions on the administrative stuff
before we move on?

00:03:11.180 --> 00:03:12.182
Question.

00:03:12.182 --> 00:03:13.902
- [Student] How much RAM should we use?

00:03:13.902 --> 00:03:16.133
- Question is how much RAM should we use?

00:03:16.133 --> 00:03:21.863
I think eight or 16 gigs is probably good
for everything that you need in this class.

00:03:21.863 --> 00:03:27.114
As you scale up the number of CPUs and the number
of RAM, you also end up spending more money.

00:03:27.114 --> 00:03:34.542
If you stick with two or four CPUs and eight or 16 gigs of RAM, that
should be plenty for all the homework-related stuff that you need to do.

00:03:36.636 --> 00:03:40.417
As a quick recap, last time we
talked about activation functions.

00:03:40.417 --> 00:03:44.962
We talked about this whole zoo of different activation
functions and some of their different properties.

00:03:44.962 --> 00:03:59.736
We saw that the sigmoid, which used to be quite popular when training neural networks maybe 10 years ago or so, has this
problem with vanishing gradients near the two ends of the activation function. tanh has this similar sort of problem.

00:03:59.736 --> 00:04:09.230
Kind of the general recommendation is that you probably want to stick with ReLU for most cases
as sort of a default choice 'cause it tends to work well for a lot of different architectures.

00:04:09.230 --> 00:04:16.820
We also talked about weight initialization. Remember that up on
the top, we have this idea that when you initialize your weights

00:04:16.820 --> 00:04:23.787
at the start of training, if those weights are initialized to be
too small, then if you look at, then the activations will vanish

00:04:23.788 --> 00:04:29.583
as you go through the network because as you multiply by these small
numbers over and over again, they'll all sort of decay to zero.

00:04:29.583 --> 00:04:33.072
Then everything will be zero,
learning won't happen, you'll be sad.

00:04:33.072 --> 00:04:41.208
On the other hand, if you initialize your weights too big, then as you go through the
network and multiply by your weight matrix over and over again, eventually they'll explode.

00:04:41.208 --> 00:04:45.389
You'll be unhappy, there'll be no
learning, it will be very bad.

00:04:45.389 --> 00:04:58.531
But if you get that initialization just right, for example, using the Xavier initialization or the MSRA
initialization, then you kind of keep a nice distribution of activations as you go through the network.

00:04:58.531 --> 00:05:04.328
Remember that this kind of gets more and more important and
more and more critical as your networks get deeper and deeper

00:05:04.328 --> 00:05:11.620
because as your network gets deeper, you're multiplying by those weight
matrices over and over again with these more multiplicative terms.

00:05:11.620 --> 00:05:23.666
We also talked last time about data preprocessing. We talked about how it's pretty typical
in conv nets to zero center and normalize your data so it has zero mean and unit variance.

00:05:23.666 --> 00:05:29.968
I wanted to provide a little bit of extra intuition
about why you might actually want to do this.

00:05:29.968 --> 00:05:39.532
Imagine a simple setup where we have a binary classification problem where
we want to draw a line to separate these red points from these blue points.

00:05:39.532 --> 00:05:46.948
On the left, you have this idea where if those data points are kind
of not normalized and not centered and far away from the origin,

00:05:46.948 --> 00:05:55.007
then we can still use a line to separate them, but now if that line wiggles
just a little bit, then our classification is going to get totally destroyed.

00:05:55.007 --> 00:06:05.992
That kind of means that in the example on the left, the loss function is now extremely
sensitive to small perturbations in that linear classifier in our weight matrix.

00:06:07.315 --> 00:06:14.554
We can still represent the same functions, but that might make
learning quite difficult because, again, their loss is very sensitive

00:06:14.554 --> 00:06:25.351
to our parameter vector, whereas in the situation on the right, if you take that data cloud
and you move it into the origin and you make it unit variance, then now, again, we can still

00:06:25.351 --> 00:06:35.523
classify that data quite well, but now as we wiggle that line a little bit, then our
loss function is less sensitive to small perturbations in the parameter values.

00:06:35.523 --> 00:06:41.064
That maybe makes optimization a little bit
easier, as we'll see a little bit going forward.

00:06:41.064 --> 00:06:46.539
By the way, this situation is not only
in the linear classification case.

00:06:46.539 --> 00:06:57.756
Inside a neural network, remember we kind of have these interleavings of these linear
matrix multiplies, or convolutions, followed by non-linear activation functions.

00:06:59.078 --> 00:07:05.687
If the input to some layer in your neural network is not
centered or not zero mean, not unit variance, then again,

00:07:05.687 --> 00:07:15.632
small perturbations in the weight matrix of that layer of the network could cause large
perturbations in the output of that layer, which, again, might make learning difficult.

00:07:15.632 --> 00:07:20.481
This is kind of a little bit of extra intuition
about why normalization might be important.

00:07:21.864 --> 00:07:26.862
Because we have this intuition that normalization is
so important, we talked about batch normalization,

00:07:26.862 --> 00:07:36.030
which is where we just add this additional layer inside our networks to just
force all of the intermediate activations to be zero mean and unit variance.

00:07:36.030 --> 00:07:41.465
I've sort of resummarized the batch normalization equations
here with the shapes a little bit more explicitly.

00:07:41.465 --> 00:07:45.172
Hopefully this can help you out when you're
implementing this thing on assignment two.

00:07:45.172 --> 00:07:59.254
But again, in batch normalization, we have this idea that in the forward pass, we use the statistics of the mini batch
to compute a mean and a standard deviation, and then use those estimates to normalize our data on the forward pass.

00:07:59.254 --> 00:08:05.641
Then we also reintroduce the scale and shift
parameters to increase the expressivity of the layer.

00:08:05.641 --> 00:08:09.990
You might want to refer back to this
when working on assignment two.

00:08:09.990 --> 00:08:18.146
We also talked last time a little bit about babysitting the learning process,
how you should probably be looking at your loss curves during training.

00:08:18.146 --> 00:08:26.683
Here's an example of some networks I was actually training over the
weekend. This is usually my setup when I'm working on these things.

00:08:26.683 --> 00:08:35.795
On the left, I have some plot showing the training loss over time. You can see it's
kind of going down, which means my network is reducing the loss. It's doing well.

00:08:35.795 --> 00:08:48.464
On the right, there's this plot where the X axis is, again, time, or the iteration number,
and the Y axis is my performance measure both on my training set and on my validation set.

00:08:48.465 --> 00:08:58.680
You can see that as we go over time, then my training set performance goes up and up and up and up and
up as my loss function goes down, but at some point, my validation set performance kind of plateaus.

00:08:58.680 --> 00:09:05.066
This kind of suggests that maybe I'm overfitting in this situation.
Maybe I should have been trying to add additional regularization.

00:09:06.317 --> 00:09:09.504
We also talked a bit last time about
hyperparameter search.

00:09:09.504 --> 00:09:14.798
All these networks have sort of a large zoo of
hyperparameters. It's pretty important to set them correctly.

00:09:14.798 --> 00:09:20.725
We talked a little bit about grid search versus random search,
and how random search is maybe a little bit nicer in theory

00:09:20.725 --> 00:09:30.669
because in the situation where your performance might be more sensitive, with respect to one
hyperparameter than other, and random search lets you cover that space a little bit better.

00:09:30.669 --> 00:09:37.005
We also talked about the idea of coarse to fine search, where
when you're doing this hyperparameter optimization, probably you

00:09:37.005 --> 00:09:43.408
want to start with very wide ranges for your hyperparameters,
only train for a couple iterations, and then based on

00:09:43.408 --> 00:09:47.973
those results, you kind of narrow in on the
range of hyperparameters that are good.

00:09:47.973 --> 00:09:51.666
Now, again, redo your search in a
smaller range for more iterations.

00:09:51.666 --> 00:09:56.708
You can kind of iterate this process to kind of
hone in on the right region for hyperparameters.

00:09:56.708 --> 00:10:04.455
But again, it's really important to, at the start, have a very coarse range to
start with, where you want very, very wide ranges for all your hyperparameters.

00:10:04.455 --> 00:10:13.746
Ideally, those ranges should be so wide that your network is kind of blowing up at either end
of the range so that you know that you've searched a wide enough range for those things.

00:10:17.462 --> 00:10:18.295
Question?

00:10:20.044 --> 00:10:26.672
- [Student] How many [speaks too low to hear]
optimize at once? [speaks too low to hear]

00:10:31.840 --> 00:10:34.554
- The question is how many hyperparameters
do we typically search at a time?

00:10:34.554 --> 00:10:38.244
Here is two, but there's a lot more
than two in these typical things.

00:10:38.244 --> 00:10:45.442
It kind of depends on the exact model and the exact architecture, but because
the number of possibilities is exponential in the number of hyperparameters,

00:10:45.442 --> 00:10:48.012
you can't really test too many at a time.

00:10:48.012 --> 00:10:51.737
It also kind of depends on how
many machines you have available.

00:10:51.737 --> 00:10:55.745
It kind of varies from person to person
and from experiment to experiment.

00:10:55.745 --> 00:11:05.353
But generally, I try not to do this over more than maybe two or three or four at
a time at most because, again, this exponential search just gets out of control.

00:11:05.353 --> 00:11:10.406
Typically, learning rate is the really
important one that you need to nail first.

00:11:10.406 --> 00:11:19.542
Then other things, like regularization, like learning rate decay, model size, these
other types of things tend to be a little bit less sensitive than learning rate.

00:11:19.542 --> 00:11:22.723
Sometimes you might do kind of a block
coordinate descent, where you go and find

00:11:22.723 --> 00:11:27.459
the good learning rate, then you go back
and try to look at different model sizes.

00:11:27.459 --> 00:11:30.759
This can help you cut down on the
exponential search a little bit,

00:11:30.759 --> 00:11:35.370
but it's a little bit problem dependent on exactly which
ones you should be searching over in which order.

00:11:36.253 --> 00:11:38.120
More questions?

00:11:38.120 --> 00:11:57.041
- [Student] [speaks too low to hear] Another parameter, but then changing that other parameter, two or three other
parameters, makes it so that your learning rate or the ideal learning rate is still [speaks too low to hear].

00:11:57.041 --> 00:12:04.537
- Question is how often does it happen where when you change one hyperparameter,
then the other, the optimal values of the other hyperparameters change?

00:12:04.537 --> 00:12:11.339
That does happen sometimes, although for learning
rates, that's typically less of a problem.

00:12:11.339 --> 00:12:18.130
For learning rates, typically you want to get in a good range, and then set
it maybe even a little bit lower than optimal, and let it go for a long time.

00:12:18.130 --> 00:12:31.291
Then if you do that, combined with some of the fancier optimization strategies that we'll talk about today,
then a lot of models tend to be a little bit less sensitive to learning rate once you get them in a good range.

00:12:31.291 --> 00:12:32.962
Sorry, did you have a
question in front, as well?

00:12:32.962 --> 00:12:37.308
- [Student] [speaks too low to hear]

00:12:37.308 --> 00:12:41.292
- The question is what's wrong with having a small
learning rate and increasing the number of epochs?

00:12:41.292 --> 00:12:45.139
The answer is that it might take
a very long time. [laughs]

00:12:45.139 --> 00:12:48.383
- [Student] [speaks too low to hear]

00:12:48.383 --> 00:12:54.853
- Intuitively, if you set the learning rate very low and let it go
for a very long time, then this should, in theory, always work.

00:12:54.853 --> 00:13:00.491
But in practice, those factors of 10 or 100 actually
matter a lot when you're training these things.

00:13:00.491 --> 00:13:03.931
Maybe if you got the right learning rate,
you could train it in six hours, 12 hours

00:13:03.931 --> 00:13:11.911
or a day, but then if you just were super safe and dropped it by a factor of 10
or by a factor of 100, now that one-day training becomes 100 days of training.

00:13:11.911 --> 00:13:16.400
That's three months.
That's not going to be good.

00:13:16.400 --> 00:13:20.668
When you're taking these intro computer science classes, they
always kind of sweep the constants under the rug, but when

00:13:20.668 --> 00:13:25.444
you're actually thinking about training things,
those constants end up mattering a lot.

00:13:25.444 --> 00:13:26.861
Another question?

00:13:27.877 --> 00:13:33.385
- [Student] If you have a low learning
rate, [speaks too low to hear].

00:13:33.385 --> 00:13:37.807
- Question is for a low learning rate, are
you more likely to be stuck in local optima?

00:13:37.807 --> 00:13:42.601
I think that makes some intuitive sense, but in
practice, that seems not to be much of a problem.

00:13:42.601 --> 00:13:47.030
I think we'll talk a bit
more about that later today.

00:13:47.030 --> 00:13:53.151
Today I wanted to talk about a couple other really interesting
and important topics when we're training neural networks.

00:13:53.151 --> 00:13:59.655
In particular, I wanted to talk, we've kind of alluded to this fact
of fancier, more powerful optimization algorithms a couple times.

00:13:59.655 --> 00:14:07.067
I wanted to spend some time today and really dig into those and talk about what
are the actual optimization algorithms that most people are using these days.

00:14:07.067 --> 00:14:10.364
We also touched on regularization
in earlier lectures.

00:14:10.364 --> 00:14:15.806
This concept of making your network do additional
things to reduce the gap between train and test error.

00:14:15.806 --> 00:14:22.143
I wanted to talk about some more strategies that people are using
in practice of regularization, with respect to neural networks.

00:14:22.143 --> 00:14:26.401
Finally, I also wanted to talk a bit
about transfer learning, where you can

00:14:26.401 --> 00:14:31.490
sometimes get away with using less data than you
think by transferring from one problem to another.

00:14:32.821 --> 00:14:39.885
If you recall from a few lectures ago, the kind of core
strategy in training neural networks is an optimization problem

00:14:39.885 --> 00:14:50.982
where we write down some loss function, which defines, for each value of the network weights,
the loss function tells us how good or bad is that value of the weights doing on our problem.

00:14:50.982 --> 00:14:56.508
Then we imagine that this loss function gives
us some nice landscape over the weights,

00:14:56.508 --> 00:15:04.142
where on the right, I've shown this maybe small, two-dimensional
problem, where the X and Y axes are two values of the weights.

00:15:04.142 --> 00:15:07.984
Then the color of the plot kind of
represents the value of the loss.

00:15:07.984 --> 00:15:15.195
In this kind of cartoon picture of a two-dimensional problem,
we're only optimizing over these two values, W one, W two.

00:15:15.195 --> 00:15:23.203
The goal is to find the most red region in this case, which
corresponds to the setting of the weights with the lowest loss.

00:15:23.203 --> 00:15:29.099
Remember, we've been working so far with this extremely
simple optimization algorithm, stochastic gradient descent,

00:15:29.099 --> 00:15:32.393
where it's super simple, it's three lines.

00:15:32.393 --> 00:15:39.179
While true, we first evaluate the loss in
the gradient on some mini batch of data.

00:15:39.179 --> 00:15:44.656
Then we step, updating our parameter vector
in the negative direction of the gradient

00:15:44.656 --> 00:15:48.798
because this gives, again, the direction
of greatest decrease of the loss function.

00:15:48.798 --> 00:15:56.282
Then we repeat this over and over again, and hopefully we converge
to the red region and we get great errors and we're very happy.

00:15:56.282 --> 00:16:05.462
But unfortunately, this relatively simple optimization algorithm has
quite a lot of problems that actually could come up in practice.

00:16:05.462 --> 00:16:08.713
One problem with stochastic
gradient descent,

00:16:08.713 --> 00:16:18.969
imagine what happens if our objective function looks something like
this, where, again, we're plotting two values, W one and W two.

00:16:18.969 --> 00:16:23.472
As we change one of those values,
the loss function changes very slowly.

00:16:23.472 --> 00:16:26.687
As we change the horizontal value,
then our loss changes slowly.

00:16:28.152 --> 00:16:34.930
As we go up and down in this landscape, now our loss is
very sensitive to changes in the vertical direction.

00:16:34.930 --> 00:16:40.757
By the way, this is referred to as the loss
having a bad condition number at this point,

00:16:40.757 --> 00:16:46.050
which is the ratio between the largest and smallest
singular values of the Hessian matrix at that point.

00:16:46.050 --> 00:16:50.497
But the intuitive idea is that the loss
landscape kind of looks like a taco shell.

00:16:50.497 --> 00:16:54.393
It's sort of very sensitive in one direction,
not sensitive in the other direction.

00:16:54.393 --> 00:17:00.633
The question is what might SGD, stochastic gradient
descent, do on a function that looks like this?

00:17:05.310 --> 00:17:12.196
If you run stochastic gradient descent on this type of function,
you might get this characteristic zigzagging behavior,

00:17:12.197 --> 00:17:22.111
where because for this type of objective function, the direction of
the gradient does not align with the direction towards the minima.

00:17:22.112 --> 00:17:29.335
When you compute the gradient and take a step, you might step
sort of over this line and sort of zigzag back and forth.

00:17:29.335 --> 00:17:35.995
In effect, you get very slow progress along the horizontal
dimension, which is the less sensitive dimension, and you get this

00:17:35.995 --> 00:17:41.551
zigzagging, nasty, nasty zigzagging behavior
across the fast-changing dimension.

00:17:41.551 --> 00:17:50.139
This is undesirable behavior. By the way, this problem
actually becomes much more common in high dimensions.

00:17:51.186 --> 00:18:00.617
In this kind of cartoon picture, we're only showing a two-dimensional optimization landscape, but in
practice, our neural networks might have millions, tens of millions, hundreds of millions of parameters.

00:18:00.617 --> 00:18:14.221
That's hundreds of millions of directions along which this thing can move. Now among those hundreds of millions of different
directions to move, if the ratio between the largest one and the smallest one is bad, then SGD will not perform so nicely.

00:18:14.221 --> 00:18:20.573
You can imagine that if we have 100 million parameters, probably
the maximum ratio between those two will be quite large.

00:18:20.573 --> 00:18:26.398
I think this is actually quite a big problem in
practice for many high-dimensional problems.

00:18:27.793 --> 00:18:33.564
Another problem with SGD has to do with
this idea of local minima or saddle points.

00:18:33.564 --> 00:18:44.003
Here I've sort of swapped the graph a little bit. Now the X axis is showing the
value of one parameter, and then the Y axis is showing the value of the loss.

00:18:44.003 --> 00:18:51.583
In this top example, we have kind of this curvy objective
function, where there's a valley in the middle.

00:18:51.583 --> 00:18:55.036
What happens to SGD in this situation?

00:18:55.036 --> 00:18:57.031
- [Student] [speaks too low to hear]

00:18:57.031 --> 00:19:04.454
- In this situation, SGD will get stuck because at this local
minima, the gradient is zero because it's locally flat.

00:19:04.454 --> 00:19:09.194
Now remember with SGD, we compute the gradient
and step in the direction of opposite gradient,

00:19:09.194 --> 00:19:15.862
so if at our current point, the opposite gradient is zero, then we're
not going to make any progress, and we'll get stuck at this point.

00:19:15.862 --> 00:19:19.406
There's another problem with this idea
of saddle points.

00:19:19.406 --> 00:19:26.140
Rather than being a local minima, you can imagine a point where
in one direction we go up, and in the other direction we go down.

00:19:26.140 --> 00:19:28.953
Then at our current point,
the gradient is zero.

00:19:28.953 --> 00:19:35.899
Again, in this situation, the function will get stuck
at the saddle point because the gradient is zero.

00:19:35.899 --> 00:19:48.122
Although one thing I'd like to point out is that in one dimension, in a one-dimensional problem like this, local
minima seem like a big problem and saddle points seem like kind of not something to worry about, but in fact,

00:19:48.122 --> 00:19:57.171
it's the opposite once you move to very high-dimensional problems because, again, if you
think about you're in this 100 million dimensional space, what does a saddle point mean?

00:19:57.171 --> 00:20:03.135
That means that at my current point, some directions the
loss goes up, and some directions the loss goes down.

00:20:03.135 --> 00:20:09.591
If you have 100 million dimensions, that's probably going to happen more
frequently than, that's probably going to happen almost everywhere, basically.

00:20:09.591 --> 00:20:16.744
Whereas a local minima says that of all those 100 million directions
that I can move, every one of them causes the loss to go up.

00:20:16.744 --> 00:20:22.316
In fact, that seems pretty rare when you're thinking
about, again, these very high-dimensional problems.

00:20:23.270 --> 00:20:33.283
Really, the idea that has come to light in the last few years is that when you're training these
very large neural networks, the problem is more about saddle points and less about local minima.

00:20:33.283 --> 00:20:40.140
By the way, this also is a problem not just exactly
at the saddle point, but also near the saddle point.

00:20:40.140 --> 00:20:47.935
If you look at the example on the bottom, you see that in the regions around
the saddle point, the gradient isn't zero, but the slope is very small.

00:20:47.935 --> 00:20:53.611
That means that if we're, again, just stepping in the direction of
the gradient, and that gradient is very small, we're going to make

00:20:53.611 --> 00:21:01.872
very, very slow progress whenever our current parameter
value is near a saddle point in the objective landscape.

00:21:01.872 --> 00:21:10.115
This is actually a big problem.
Another problem with SGD comes from the S.

00:21:10.115 --> 00:21:13.521
Remember that SGD is
stochastic gradient descent.

00:21:13.521 --> 00:21:20.586
Recall that our loss function is typically defined by
computing the loss over many, many different examples.

00:21:20.586 --> 00:21:26.119
In this case, if N is your whole training set,
then that could be something like a million.

00:21:26.119 --> 00:21:29.347
Each time computing the loss
would be very, very expensive.

00:21:29.347 --> 00:21:36.957
In practice, remember that we often estimate the loss and
estimate the gradient using a small mini batch of examples.

00:21:36.957 --> 00:21:42.148
What this means is that we're not actually getting the
true information about the gradient at every time step.

00:21:42.148 --> 00:21:46.773
Instead, we're just getting some noisy
estimate of the gradient at our current point.

00:21:46.773 --> 00:21:50.575
Here on the right, I've kind of faked
this plot a little bit.

00:21:50.575 --> 00:21:59.927
I've just added random uniform noise to the gradient at every
point, and then run SGD with these noisy, messed up gradients.

00:21:59.927 --> 00:22:07.987
This is maybe not exactly what happens with the SGD process, but it still give
you the sense that if there's noise in your gradient estimates, then vanilla SGD

00:22:07.987 --> 00:22:14.036
kind of meanders around the space and might actually
take a long time to get towards the minima.

00:22:15.723 --> 00:22:18.966
Now that we've talked about a lot
of these problems.

00:22:18.966 --> 00:22:20.956
Sorry, was there a question?

00:22:20.956 --> 00:22:25.123
- [Student] [speaks too low to hear]

00:22:29.099 --> 00:22:34.435
- The question is do all of these just go
away if we use normal gradient descent?

00:22:35.281 --> 00:22:44.106
Let's see. I think that the taco shell problem of high condition
numbers is still a problem with full batch gradient descent.

00:22:44.106 --> 00:22:54.120
The noise. As we'll see, we might sometimes introduce additional noise into the network, not
only due to sampling mini batches, but also due to explicit stochasticity in the network,

00:22:54.120 --> 00:22:57.736
so we'll see that later.
That can still be a problem.

00:22:57.736 --> 00:23:05.101
Saddle points, that's still a problem for full batch gradient descent
because there can still be saddle points in the full objective landscape.

00:23:05.101 --> 00:23:10.249
Basically, even if we go to full batch gradient
descent, it doesn't really solve these problems.

00:23:10.249 --> 00:23:16.604
We kind of need to think about a slightly fancier optimization
algorithm that can try to address these concerns.

00:23:16.604 --> 00:23:21.966
Thankfully, there's a really, really simple strategy that
works pretty well at addressing many of these problems.

00:23:21.966 --> 00:23:26.978
That's this idea of adding a momentum term
to our stochastic gradient descent.

00:23:26.978 --> 00:23:32.923
Here on the left, we have our classic old friend, SGD, where
we just always step in the direction of the gradient.

00:23:32.923 --> 00:23:43.062
But now on the right, we have this minor, minor variance called SGD plus momentum,
which is now two equations and five lines of code, so it's twice as complicated.

00:23:43.062 --> 00:23:51.331
But it's very simple. The idea is that we maintain a velocity
over time, and we add our gradient estimates to the velocity.

00:23:51.331 --> 00:23:57.811
Then we step in the direction of the velocity, rather
than stepping in the direction of the gradient.

00:23:57.811 --> 00:24:04.825
This is very, very simple. We also have this
hyperparameter rho now which corresponds to friction.

00:24:05.925 --> 00:24:16.848
Now at every time step, we take our current velocity, we decay the current velocity by
the friction constant, rho, which is often something high, like .9 is a common choice.

00:24:16.848 --> 00:24:21.173
We take our current velocity, we decay it
by friction and we add in our gradient.

00:24:21.173 --> 00:24:26.999
Now we step in the direction of our velocity vector,
rather than the direction of our raw gradient vector.

00:24:28.327 --> 00:24:34.548
This super, super simple strategy actually helps for
all of these problems that we just talked about.

00:24:34.548 --> 00:24:44.809
If you think about what happens at local minima or saddle points, then if we're imagining
velocity in this system, then you kind of have this physical interpretation of this ball

00:24:44.809 --> 00:24:48.215
kind of rolling down the hill,
picking up speed as it comes down.

00:24:48.215 --> 00:24:56.922
Now once we have velocity, then even when we pass that point of local minima,
the point will still have velocity, even if it doesn't have gradient.

00:24:56.922 --> 00:25:01.039
Then we can hopefully get over this
local minima and continue downward.

00:25:01.039 --> 00:25:03.809
There's this similar
intuition near saddle points,

00:25:03.809 --> 00:25:10.734
where even though the gradient around the saddle point is very small,
we have this velocity vector that we've built up as we roll downhill.

00:25:10.734 --> 00:25:16.462
That can hopefully carry us through the saddle point
and let us continue rolling all the way down.

00:25:16.462 --> 00:25:21.949
If you think about what happens in
poor conditioning, now if we were to have

00:25:21.949 --> 00:25:31.105
these kind of zigzagging approximations to the gradient, then those zigzags
will hopefully cancel each other out pretty fast once we're using momentum.

00:25:31.105 --> 00:25:46.006
- This will effectively reduce the amount by which we step in the sensitive direction, whereas in the horizontal direction, our velocity will just keep building up, and will actually accelerate our descent
- across that less sensitive dimension.

00:25:46.006 --> 00:25:51.344
Adding momentum here can actually help us with
this high condition number problem, as well.

00:25:51.344 --> 00:26:05.207
Finally, on the right, we've repeated the same visualization of gradient descent with noise. Here, the black is this
vanilla SGD, which is sort of zigzagging all over the place, where the blue line is showing now SGD with momentum.

00:26:05.207 --> 00:26:12.644
You can see that because we're adding it, we're building up this velocity
over time, the noise kind of gets averaged out in our gradient estimates.

00:26:12.644 --> 00:26:20.337
Now SGD ends up taking a much smoother path towards the minima,
compared with the SGD, which is kind of meandering due to noise.

00:26:20.337 --> 00:26:21.532
Question?

00:26:21.532 --> 00:26:25.699
- [Student] [speaks too low to hear]

00:26:34.776 --> 00:26:40.465
- The question is how does SGD momentum help
with the poorly conditioned coordinate?

00:26:40.465 --> 00:26:49.125
The idea is that if you go back and look at this velocity estimate and look
at the velocity computation, we're adding in the gradient at every time step.

00:26:49.125 --> 00:26:56.603
It kind of depends on your setting of rho, that hyperparameter,
but you can imagine that if the gradient is relatively small,

00:26:56.603 --> 00:26:59.254
and if rho is well
behaved in this situation,

00:26:59.254 --> 00:27:05.512
then our velocity could actually monotonically increase up to a point
where the velocity could now be larger than the actual gradient.

00:27:05.512 --> 00:27:10.377
Then we might actually make faster progress
along the poorly conditioned dimension.

00:27:12.569 --> 00:27:18.020
Kind of one picture that you can have in mind
when we're doing SGD plus momentum is that

00:27:18.020 --> 00:27:20.273
the red here is our current point.

00:27:20.273 --> 00:27:30.075
At our current point, we have some red vector, which is the direction of the gradient, or rather our
estimate of the gradient at the current point. Green is now the direction of our velocity vector.

00:27:30.075 --> 00:27:36.317
Now when we do the momentum update, we're actually
stepping according to a weighted average of these two.

00:27:36.317 --> 00:27:40.049
This helps overcome some noise in our
gradient estimate.

00:27:40.049 --> 00:27:47.724
There's a slight variation of momentum that you sometimes see, called
Nesterov accelerated gradient, also sometimes called Nesterov momentum.

00:27:47.724 --> 00:27:51.737
That switches up this order of things
a little bit.

00:27:51.737 --> 00:28:00.285
In sort of normal SGD momentum, we imagine that we estimate the gradient at
our current point, and then take a mix of our velocity and our gradient.

00:28:00.285 --> 00:28:04.229
With Nesterov accelerated gradient, you do
something a little bit different.

00:28:04.229 --> 00:28:10.765
Here, you start at the red point. You step in the
direction of where the velocity would take you.

00:28:10.765 --> 00:28:18.732
You evaluate the gradient at that point. Then you go back
to your original point and kind of mix together those two.

00:28:18.732 --> 00:28:25.679
This is kind of a funny interpretation, but you can imagine that
you're kind of mixing together information a little bit more.

00:28:25.679 --> 00:28:34.702
If your velocity direction was actually a little bit wrong, it lets you incorporate
gradient information from a little bit larger parts of the objective landscape.

00:28:34.702 --> 00:28:39.351
This also has some really nice theoretical
properties when it comes to convex optimization,

00:28:39.351 --> 00:28:45.946
but those guarantees go a little bit out the window once
it comes to non-convex problems like neural networks.

00:28:45.946 --> 00:28:51.061
Writing it down in equations, Nesterov
momentum looks something like this, where now

00:28:51.061 --> 00:28:57.155
to update our velocity, we take a step, according to our
previous velocity, and evaluate that gradient there.

00:28:57.155 --> 00:29:06.222
Now when we take our next step, we actually step in the direction of our
velocity that's incorporating information from these multiple points.

00:29:06.222 --> 00:29:07.055
Question?

00:29:08.437 --> 00:29:12.357
- [Student] [speaks too low to hear]
- Oh, sorry.

00:29:12.357 --> 00:29:14.743
The question is what's a good
initialization for the velocity?

00:29:14.743 --> 00:29:16.998
This is almost always zero.

00:29:16.998 --> 00:29:20.096
It's not even a hyperparameter.
Just set it to zero and don't worry.

00:29:20.096 --> 00:29:21.315
Another question?

00:29:21.315 --> 00:29:25.482
- [Student] [speaks too low to hear]

00:29:31.992 --> 00:29:38.068
- Intuitively, the velocity is kind of a weighted
sum of your gradients that you've seen over time.

00:29:38.068 --> 00:29:41.466
- [Student] [speaks too low to hear]

00:29:41.466 --> 00:29:44.027
- With more recent gradients
being weighted heavier.

00:29:44.027 --> 00:29:49.716
At every time step, we take our old velocity, we decay
by friction and we add in our current gradient.

00:29:49.716 --> 00:29:54.662
You can kind of think of this as a smooth
moving average of your recent gradients

00:29:54.662 --> 00:30:00.109
with kind of a exponentially decaying weight
on your gradients going back in time.

00:30:02.627 --> 00:30:11.632
This Nesterov formulation is a little bit annoying 'cause if you look at this, normally when you
have your loss function, you want to evaluate your loss and your gradient at the same point.

00:30:11.632 --> 00:30:19.283
Nesterov breaks this a little bit. It's a little bit annoying to work
with. Thankfully, there's a cute change of variables you can do.

00:30:19.283 --> 00:30:29.392
If you do the change of variables and reshuffle a little bit, then you can write Nesterov momentum in a
slightly different way that now, again, lets you evaluate the loss and the gradient at the same point always.

00:30:29.392 --> 00:30:34.093
Once you make this change of variables, you get
kind of a nice interpretation of Nesterov,

00:30:34.093 --> 00:30:41.739
which is that here in the first step, this looks exactly like
updating the velocity in the vanilla SGD momentum case, where we

00:30:41.739 --> 00:30:48.178
have our current velocity, we evaluate gradient at the
current point and mix these two together in a decaying way.

00:30:48.178 --> 00:30:51.951
Now in the second update, now when we're actually
updating our parameter vector, if you look

00:30:51.951 --> 00:30:57.592
at the second equation, we have our current
point plus our current velocity plus

00:30:57.592 --> 00:31:01.454
a weighted difference between our current
velocity and our previous velocity.

00:31:01.454 --> 00:31:11.271
Here, Nesterov momentum is kind of incorporating some kind of error-correcting
term between your current velocity and your previous velocity.

00:31:13.029 --> 00:31:25.249
If we look at SGD, SGD momentum and Nesterov momentum on this kind of simple problem, compared with SGD, we
notice that SGD kind of takes this, SGD is in the black, kind of taking this slow progress toward the minima.

00:31:26.346 --> 00:31:29.598
The blue and the green show momentum
and Nesterov.

00:31:29.598 --> 00:31:36.803
These have this behavior of kind of overshooting the minimum 'cause
they're building up velocity going past the minimum, and then kind of

00:31:36.803 --> 00:31:39.849
correcting themselves and coming back
towards the minima.

00:31:39.849 --> 00:31:40.682
Question?

00:31:42.023 --> 00:31:46.190
- [Student] [speaks too low to hear]

00:31:52.024 --> 00:31:58.050
- The question is this picture looks good, but what happens
if your minima call lies in this very narrow basin?

00:31:58.050 --> 00:32:01.527
Will the velocity just cause you
to skip right over that minima?

00:32:01.527 --> 00:32:05.232
That's actually a really interesting point, and
the subject of some recent theoretical work,

00:32:05.232 --> 00:32:09.071
but the idea is that maybe those really
sharp minima are actually bad minima.

00:32:09.071 --> 00:32:17.601
We don't want to even land in those 'cause the idea is that maybe if you have
a very sharp minima, that actually could be a minima that overfits more.

00:32:17.601 --> 00:32:22.026
If you imagine that we doubled our training set,
the whole optimization landscape would change,

00:32:22.026 --> 00:32:27.420
and maybe that very sensitive minima would actually
disappear if we were to collect more training data.

00:32:27.420 --> 00:32:31.189
We kind of have this intuition that we
maybe want to land in very flat minima

00:32:31.189 --> 00:32:35.933
because those very flat minima are probably
more robust as we change the training data.

00:32:35.933 --> 00:32:40.453
Those flat minima might actually
generalize better to testing data.

00:32:40.453 --> 00:32:46.284
This is again, sort of very recent theoretical work, but
that's actually a really good point that you bring it up.

00:32:46.284 --> 00:32:54.354
In some sense, it's actually a feature and not a bug that
SGD momentum actually skips over those very sharp minima.

00:32:55.979 --> 00:32:59.979
That's actually a good
thing, believe it or not.

00:33:00.825 --> 00:33:04.316
Another thing you can see is if you look at the
difference between momentum and Nesterov here,

00:33:04.316 --> 00:33:12.715
you can see that because of the correction factor in Nesterov, maybe it's
not overshooting quite as drastically, compared to vanilla momentum.

00:33:14.683 --> 00:33:20.068
There's another kind of common optimization
strategy is this algorithm called AdaGrad,

00:33:20.068 --> 00:33:25.292
which John Duchi, who's now a professor
here, worked on during his Ph.D.

00:33:25.292 --> 00:33:37.663
The idea with AdaGrad is that as you, during the course of the optimization, you're going to keep
a running estimate or a running sum of all the squared gradients that you see during training.

00:33:39.569 --> 00:33:43.957
Now rather than having a velocity term,
instead we have this grad squared term.

00:33:43.957 --> 00:33:49.199
During training, we're going to just keep adding
the squared gradients to this grad squared term.

00:33:49.199 --> 00:33:57.449
Now when we update our parameter vector, we'll divide by
this grad squared term when we're making our update step.

00:33:59.334 --> 00:34:07.261
The question is what does this kind of scaling do in this
situation where we have a very high condition number?

00:34:08.393 --> 00:34:12.560
- [Student] [speaks too low to hear]

00:34:16.256 --> 00:34:22.904
- The idea is that if we have two coordinates, one that always has a
very high gradient and one that always has a very small gradient,

00:34:22.904 --> 00:34:35.181
then as we add the sum of the squares of the small gradient, we're going to be dividing by a
small number, so we'll accelerate movement along the slow dimension, along the one dimension.

00:34:35.181 --> 00:34:45.924
Then along the other dimension, where the gradients tend to be very large, then we'll be dividing
by a large number, so we'll kind of slow down our progress along the wiggling dimension.

00:34:45.924 --> 00:34:56.093
But there's kind of a problem here. That's the question of what happens with
AdaGrad over the course of training, as t gets larger and larger and larger?

00:34:56.094 --> 00:34:58.391
- [Student] [speaks too low to hear]

00:34:58.391 --> 00:35:02.239
- With AdaGrad, the steps actually get smaller
and smaller and smaller because we just

00:35:02.239 --> 00:35:09.895
continue updating this estimate of the squared gradients over time, so this
estimate just grows and grows and grows monotonically over the course of training.

00:35:09.895 --> 00:35:15.359
Now this causes our step size to get
smaller and smaller and smaller over time.

00:35:15.359 --> 00:35:20.334
Again, in the convex case, there's some really
nice theory showing that this is actually

00:35:20.334 --> 00:35:28.125
really good 'cause in the convex case, as you approach a
minimum, you kind of want to slow down so you actually converge.

00:35:28.125 --> 00:35:31.192
That's actually kind of a feature
in the convex case.

00:35:31.192 --> 00:35:42.007
But in the non-convex case, that's a little bit problematic because as you come towards a saddle
point, you might get stuck with AdaGrad, and then you kind of no longer make any progress.

00:35:42.007 --> 00:35:48.678
There's a slight variation of AdaGrad, called RMSProp,
that actually addresses this concern a little bit.

00:35:48.678 --> 00:35:53.390
Now with RMSProp, we still keep this estimate
of the squared gradients, but instead

00:35:53.390 --> 00:36:01.085
of just letting that squared estimate continually accumulate over
training, instead, we let that squared estimate actually decay.

00:36:01.085 --> 00:36:09.340
This ends up looking kind of like a momentum update, except we're having kind of
momentum over the squared gradients, rather than momentum over the actual gradients.

00:36:09.340 --> 00:36:20.361
Now with RMSProp, after we compute our gradient, we take our current estimate of the grad
squared, we multiply it by this decay rate, which is commonly something like .9 or .99.

00:36:20.361 --> 00:36:26.601
Then we add in this one minus the decay
rate of our current squared gradient.

00:36:26.601 --> 00:36:37.193
Now over time, you can imagine that. Then again, when we make our step, the
step looks exactly the same as AdaGrad, where we divide by the squared gradient

00:36:37.193 --> 00:36:44.070
in the step to again have this nice property of accelerating movement along
the one dimension, and slowing down movement along the other dimension.

00:36:44.070 --> 00:36:52.411
But now with RMSProp, because these estimates are leaky, then it kind of
addresses the problem of maybe always slowing down where you might not want to.

00:36:56.455 --> 00:37:04.173
Here again, we're kind of showing our favorite toy problem
with SGD in black, SGD momentum in blue and RMSProp in red.

00:37:04.173 --> 00:37:12.263
You can see that RMSProp and SGD momentum are both doing much better
than SGD, but their qualitative behavior is a little bit different.

00:37:12.263 --> 00:37:17.488
With SGD momentum, it kind of overshoots
the minimum and comes back, whereas with

00:37:17.488 --> 00:37:26.392
RMSProp, it's kind of adjusting its trajectory such that we're
making approximately equal progress among all the dimensions.

00:37:26.392 --> 00:37:34.412
By the way, you can't actually tell, but this plot is also showing
AdaGrad in green with the same learning rate, but it just

00:37:34.412 --> 00:37:38.606
gets stuck due to this problem of
continually decaying learning rates.

00:37:38.606 --> 00:37:45.553
In practice, AdaGrad is maybe not so common for many of these
things. That's a little bit of an unfair comparison of AdaGrad.

00:37:46.392 --> 00:37:52.558
Probably you need to increase the learning rate with AdaGrad, and
then it would end up looking kind of like RMSProp in this case.

00:37:52.558 --> 00:37:57.148
But in general, we tend not to use AdaGrad
so much when training neural networks.

00:37:57.148 --> 00:37:57.981
Question?

00:37:57.981 --> 00:37:59.796
- [Student] [speaks too low to hear]

00:37:59.796 --> 00:38:04.387
- The answer is yes, this problem
is convex, but in this case,

00:38:07.146 --> 00:38:12.003
it's a little bit of an unfair comparison because the
learning rates are not so comparable among the methods.

00:38:12.003 --> 00:38:17.290
I've been a little bit unfair to AdaGrad in this visualization by
showing the same learning rate between the different algorithms,

00:38:17.290 --> 00:38:23.132
when probably you should have separately
turned the learning rates per algorithm.

00:38:27.970 --> 00:38:35.403
We saw in momentum, we had this idea of velocity, where we're building up velocity
by adding in the gradients, and then stepping in the direction of the velocity.

00:38:35.403 --> 00:38:43.744
We saw with AdaGrad and RMSProp that we had this other idea, of building up an
estimate of the squared gradients, and then dividing by the squared gradients.

00:38:43.744 --> 00:38:46.658
Then these both seem like good ideas
on their own.

00:38:46.658 --> 00:38:50.898
Why don't we just stick 'em together and use
them both? Maybe that would be even better.

00:38:50.898 --> 00:38:56.741
That brings us to this algorithm called Adam,
or rather brings us very close to Adam.

00:38:56.741 --> 00:39:01.119
We'll see in a couple slides that there's
a slight correction we need to make here.

00:39:01.119 --> 00:39:06.888
Here with Adam, we maintain an estimate
of the first moment and the second moment.

00:39:06.888 --> 00:39:14.667
Now in the red, we make this estimate of the
first moment as a weighed sum of our gradients.

00:39:14.667 --> 00:39:22.741
We have this moving estimate of the second moment, like AdaGrad and
like RMSProp, which is a moving estimate of our squared gradients.

00:39:22.741 --> 00:39:28.621
Now when we make our update step, we step using
both the first moment, which is kind of our

00:39:28.621 --> 00:39:37.281
velocity, and also divide by the second moment, or rather the square
root of the second moment, which is this squared gradient term.

00:39:38.128 --> 00:39:46.269
This idea of Adam ends up looking a little bit like RMSProp plus momentum,
or ends up looking like momentum plus second squared gradients.

00:39:46.269 --> 00:39:51.989
It kind of incorporates the nice properties of
both. But there's a little bit of a problem here.

00:39:51.989 --> 00:40:06.134
That's the question of what happens at the very first time step? At the very first time
step, you can see that at the beginning, we've initialized our second moment with zero.

00:40:06.134 --> 00:40:16.803
Now after one update of the second moment, typically this beta two, second
moment decay rate, is something like .9 or .99, something very close to one.

00:40:18.235 --> 00:40:22.867
After one update, our second moment
is still very, very close to zero.

00:40:22.867 --> 00:40:32.377
Now when we're making our update step here and we divide by our second moment, now we're
dividing by a very small number. We're making a very, very large step at the beginning.

00:40:32.377 --> 00:40:37.768
This very, very large step at the beginning is
not really due to the geometry of the problem.

00:40:37.768 --> 00:40:43.422
It's kind of an artifact of the fact that we
initialized our second moment estimate was zero.

00:40:43.422 --> 00:40:44.322
Question?

00:40:44.322 --> 00:40:48.489
- [Student] [speaks too low to hear]

00:40:52.832 --> 00:41:00.365
- That's true. The comment is that if your first moment is also very
small, then you're multiplying by small and you're dividing by square root

00:41:00.365 --> 00:41:02.906
of small squared, so
what's going to happen?

00:41:02.906 --> 00:41:05.746
They might cancel each other
out, you might be okay.

00:41:05.746 --> 00:41:13.632
That's true. Sometimes these cancel each other out and you're okay, but
sometimes this ends up in taking very large steps right at the beginning.

00:41:13.632 --> 00:41:16.245
That can be quite bad.

00:41:16.245 --> 00:41:19.533
Maybe you initialize a little bit poorly.
You take a very large step.

00:41:19.533 --> 00:41:26.145
Now your initialization is completely messed up, and then you're in a very
bad part of the objective landscape and you just can't converge from there.

00:41:26.145 --> 00:41:27.165
Question?

00:41:27.165 --> 00:41:30.915
- [Student] [speaks too low to hear]

00:41:30.915 --> 00:41:37.847
- The idea is what is this 10 to the minus seven term in the last
equation? That's actually appeared in AdaGrad, RMSProp and Adam.

00:41:37.847 --> 00:41:42.187
The idea is that we're dividing by something. We
want to make sure we're not dividing by zero,

00:41:42.187 --> 00:41:48.609
so we always add a small positive constant to the
denominator, just to make sure we're not dividing by zero.

00:41:48.609 --> 00:41:56.012
That's technically a hyperparameter, but it tends not to matter too much, so just
setting 10 to minus seven, 10 to minus eight, something like that, tends to work well.

00:41:57.967 --> 00:42:05.511
With Adam, remember we just talked about this idea of at the first couple steps,
it gets very large, and we might take very large steps and mess ourselves up.

00:42:05.511 --> 00:42:12.510
Adam also adds this bias correction term to avoid this
problem of taking very large steps at the beginning.

00:42:12.510 --> 00:42:22.619
You can see that after we update our first and second moments, we create an unbiased
estimate of those first and second moments by incorporating the current time step, t.

00:42:22.619 --> 00:42:29.550
Now we actually make our step using these unbiased estimates,
rather than the original first and second moment estimates.

00:42:29.550 --> 00:42:33.167
This gives us our full form of Adam.

00:42:33.167 --> 00:42:45.550
By the way, Adam is a really, [laughs] really good optimization algorithm, and it works really well for a lot of different
problems, so that's kind of my default optimization algorithm for just about any new problem that I'm tackling.

00:42:45.550 --> 00:42:53.088
In particular, if you set beta one equals .9, beta two equals
.999, learning rate one e minus three or five e minus four,

00:42:53.088 --> 00:42:58.797
that's a great staring point for just about
all the architectures I've ever worked with.

00:42:58.797 --> 00:43:03.518
Try that. That's a really good
place to start, in general.

00:43:03.518 --> 00:43:05.949
[laughs]

00:43:05.949 --> 00:43:11.634
If we actually plot these things out and look at SGD,
SGD momentum, RMSProp and Adam on the same problem,

00:43:11.634 --> 00:43:18.094
you can see that Adam, in the purple here, kind of
combines elements of both SGD momentum and RMSProp.

00:43:18.094 --> 00:43:25.175
Adam kind of overshoots the minimum a little bit like SGD
momentum, but it doesn't overshoot quite as much as momentum.

00:43:25.175 --> 00:43:33.268
Adam also has this similar behavior of RMSProp of kind of
trying to curve to make equal progress along all dimensions.

00:43:33.268 --> 00:43:37.706
Maybe in this small two-dimensional example,
Adam converged about similarly to other ones,

00:43:37.706 --> 00:43:42.832
but you can see qualitatively that it's kind of
combining the behaviors of both momentum and RMSProp.

00:43:45.042 --> 00:43:48.709
Any questions about
optimization algorithms?

00:43:50.048 --> 00:43:56.606
- [Student] [speaks too low to hear] They still take
a very long time to train. [speaks too low to hear]

00:43:56.606 --> 00:44:03.193
- The question is what does Adam not fix? Would these neural
networks are still large, they still take a long time to train.

00:44:04.744 --> 00:44:07.098
There can still be a problem.

00:44:07.098 --> 00:44:11.979
In this picture where we have this landscape
of things looking like ovals, if you imagine

00:44:11.979 --> 00:44:19.219
that we're kind of making estimates along each dimension
independently to allow us to speed up or slow down along different

00:44:19.219 --> 00:44:26.576
coordinate axes, but one problem is that if that taco shell
is kind of tilted and is not axis aligned, then we're still

00:44:26.576 --> 00:44:29.887
only making estimates along the
individual axes independently.

00:44:30.935 --> 00:44:38.131
That corresponds to taking your rotated taco shell and squishing it
horizontally and vertically, but you can't actually unrotate it.

00:44:38.131 --> 00:44:48.732
In cases where you have this kind of rotated picture of poor conditioning, then
Adam or any of these other algorithms really can't address that, that concern.

00:44:51.356 --> 00:44:57.706
Another thing that we've seen in all these optimization
algorithms is learning rate as a hyperparameter.

00:44:57.706 --> 00:45:01.828
We've seen this picture before a couple times,
that as you use different learning rates,

00:45:01.828 --> 00:45:05.097
sometimes if it's too high, it
might explode in the yellow.

00:45:05.097 --> 00:45:09.629
If it's a very low learning rate, in the blue,
it might take a very long time to converge.

00:45:09.629 --> 00:45:11.933
It's kind of tricky to pick the right
learning rate.

00:45:13.712 --> 00:45:19.308
This is a little bit of a trick question because we don't actually have
to stick with one learning rate throughout the course of training.

00:45:19.308 --> 00:45:29.705
Sometimes you'll see people decay the learning rates over time, where we can kind of combine
the effects of these different curves on the left, and get the nice properties of each.

00:45:29.705 --> 00:45:39.366
Sometimes you'll start with a higher learning rate near the start of training, and then
decay the learning rate and make it smaller and smaller throughout the course of training.

00:45:39.366 --> 00:45:46.795
A couple strategies for these would be a step decay, where at 100,000th
iteration, you just decay by some factor and you keep going.

00:45:46.795 --> 00:45:52.579
You might see an exponential decay, where
you continually decay during training.

00:45:52.579 --> 00:45:57.598
You might see different variations of continually
decaying the learning rate during training.

00:45:57.598 --> 00:46:04.347
If you look at papers, especially the resonate paper, you often see plots
that look kind of like this, where the loss is kind of going down,

00:46:04.347 --> 00:46:07.898
then dropping, then flattening again,
then dropping again.

00:46:07.898 --> 00:46:11.312
What's going on in these plots is that
they're using a step decay learning rate,

00:46:11.312 --> 00:46:18.401
where at these parts where it plateaus and then suddenly drops again, those
are the iterations where they dropped the learning rate by some factor.

00:46:18.401 --> 00:46:26.243
This idea of dropping the learning rate, you might imagine that
it got near some good region, but now the gradients got smaller,

00:46:26.243 --> 00:46:28.066
it's kind of bouncing around too much.

00:46:28.066 --> 00:46:32.745
Then if we drop the learning rate, it lets it slow
down and continue to make progress down the landscape.

00:46:32.745 --> 00:46:36.475
This tends to help in practice sometimes.

00:46:36.475 --> 00:46:44.973
Although one thing to point out is that learning rate decay is a little bit more
common with SGD momentum, and a little bit less common with something like Adam.

00:46:44.973 --> 00:46:50.458
Another thing I'd like to point out is that learning
rate decay is kind of a second-order hyperparameter.

00:46:50.458 --> 00:46:53.324
You typically should not optimize
over this thing from the start.

00:46:53.324 --> 00:47:00.877
Usually when you're kind of getting networks to work at the beginning, you
want to pick a good learning rate with no learning rate decay from the start.

00:47:00.877 --> 00:47:06.068
Trying to cross-validate jointly over learning rate decay and
initial learning rate and other things, you'll just get confused.

00:47:06.068 --> 00:47:10.581
What you do for setting learning rate decay
is try with no decay, see what happens.

00:47:10.581 --> 00:47:15.427
Then kind of eyeball the loss curve and
see where you think you might need decay.

00:47:16.860 --> 00:47:24.948
Another thing I wanted to mention briefly is this idea of all these
algorithms that we've talked about are first-order optimization algorithms.

00:47:24.948 --> 00:47:33.064
In this picture, in this one-dimensional picture, we have this
kind of curvy objective function at our current point in red.

00:47:33.064 --> 00:47:36.057
What we're basically doing is computing
the gradient at that point.

00:47:36.057 --> 00:47:40.722
We're using the gradient information to compute
some linear approximation to our function,

00:47:40.722 --> 00:47:44.208
which is kind of a first-order Taylor
approximation to our function.

00:47:44.208 --> 00:47:51.814
Now we pretend that the first-order approximation is our actual
function, and we make a step to try to minimize the approximation.

00:47:51.814 --> 00:47:57.353
But this approximation doesn't hold for very large
regions, so we can't step too far in that direction.

00:47:57.353 --> 00:48:04.509
But really, the idea here is that we're only incorporating information about the
first derivative of the function. You can actually go a little bit fancier.

00:48:04.509 --> 00:48:11.248
There's this idea of second-order approximation, where we take into
account both first derivative and second derivative information.

00:48:11.248 --> 00:48:18.449
Now we make a second-order Taylor approximation to our function
and kind of locally approximate our function with a quadratic.

00:48:18.449 --> 00:48:22.281
Now with a quadratic, you can step right
to the minimum, and you're really happy.

00:48:22.281 --> 00:48:25.769
That's this idea of
second-order optimization.

00:48:25.769 --> 00:48:30.489
When you generalize this to multiple dimensions,
you get something called the Newton step,

00:48:30.489 --> 00:48:35.066
where you compute this Hessian matrix,
which is a matrix of second derivatives,

00:48:35.066 --> 00:48:43.689
and you end up inverting this Hessian matrix in order to step directly
to the minimum of this quadratic approximation to your function.

00:48:43.689 --> 00:48:48.910
Does anyone spot something that's quite different about this
update rule, compared to the other ones that we've seen?

00:48:48.910 --> 00:48:51.107
- [Student] [speaks too low to hear]

00:48:51.107 --> 00:48:54.328
- This doesn't have a learning rate.
That's kind of cool.

00:48:56.463 --> 00:49:00.664
We're making this quadratic approximation and we're
stepping right to the minimum of the quadratic.

00:49:00.664 --> 00:49:04.681
At least in this vanilla version of Newton's
method, you don't actually need a learning rate.

00:49:04.681 --> 00:49:07.849
You just always step to the minimum
at every time step.

00:49:07.849 --> 00:49:13.265
However, in practice, you might end up, have a learning rate anyway
because, again, that quadratic approximation might not be perfect,

00:49:13.265 --> 00:49:21.055
so you might only want to step in the direction towards the minimum, rather than actually
stepping to the minimum, but at least in this vanilla version, it doesn't have a learning rate.

00:49:23.994 --> 00:49:27.366
But unfortunately, this is maybe
a little bit impractical for deep learning

00:49:27.366 --> 00:49:34.519
because this Hessian matrix is N by N, where N
is the number of parameters in your network.

00:49:34.519 --> 00:49:38.498
If N is 100 million, then 100
million squared is way too big.

00:49:38.498 --> 00:49:42.046
You definitely can't store that in memory,
and you definitely can't invert it.

00:49:42.046 --> 00:49:46.486
In practice, people sometimes use these
quasi-Newton methods that, rather than working

00:49:46.486 --> 00:49:52.725
with the full Hessian and inverting the full Hessian, they
work with approximations. Low-rank approximations are common.

00:49:52.725 --> 00:49:57.092
You'll sometimes see
these for some problems.

00:49:57.092 --> 00:50:03.487
L-BFGS is one particular second-order optimizer that has this
approximate second, keeps this approximation of the Hessian

00:50:03.487 --> 00:50:11.205
that you'll sometimes see, but in practice, it doesn't work too
well for many deep learning problems because these approximations,

00:50:11.205 --> 00:50:16.410
these second-order approximations, don't really
handle the stochastic case very much, very nicely.

00:50:16.410 --> 00:50:20.616
They also tend not to work so well with
non-convex problems.

00:50:20.616 --> 00:50:23.142
I don't want to get into
that right now too much.

00:50:23.142 --> 00:50:29.022
In practice, what you should really do is probably Adam is a
really good choice for many different neural network things,

00:50:29.022 --> 00:50:38.974
but if you're in a situation where you can afford to do full batch updates, and you know that
your problem doesn't have really any stochasticity, then L-BFGS is kind of a good choice.

00:50:38.974 --> 00:50:43.181
L-BFGS doesn't really get used for training
neural networks too much, but as we'll see

00:50:43.181 --> 00:50:47.251
in a couple of lectures, it does sometimes
get used for things like style transfer,

00:50:47.251 --> 00:50:54.356
where you actually have less stochasticity and fewer parameters,
but you still want to solve an optimization problem.

00:50:55.834 --> 00:51:00.992
All of these strategies we've talked about
so far are about reducing training error.

00:51:02.344 --> 00:51:07.452
All these optimization algorithms are really about driving down
your training error and minimizing your objective function,

00:51:07.452 --> 00:51:10.403
but we don't really care about
training error that much.

00:51:10.403 --> 00:51:13.203
Instead, we really care about
our performance on unseen data.

00:51:13.203 --> 00:51:16.817
We really care about reducing this gap
between train and test error.

00:51:16.817 --> 00:51:21.228
The question is once we're already
good at optimizing our objective function,

00:51:21.228 --> 00:51:25.535
what can we do to try to reduce this gap and
make our model perform better on unseen data?

00:51:28.497 --> 00:51:33.617
One really quick and dirty, easy thing
to try is this idea of model ensembles

00:51:33.617 --> 00:51:36.767
that sometimes works across many
different areas in machine learning.

00:51:36.767 --> 00:51:44.588
The idea is pretty simple. Rather than having just one model, we'll train
10 different models independently from different initial random restarts.

00:51:44.588 --> 00:51:51.333
Now at test time, we'll run our data through all of the
10 models and average the predictions of those 10 models.

00:51:53.562 --> 00:52:01.555
Adding these multiple models together tends to reduce overfitting a little bit
and tend to improve performance a little bit, typically by a couple percent.

00:52:01.555 --> 00:52:05.302
This is generally not a drastic improvement,
but it is a consistent improvement.

00:52:05.302 --> 00:52:13.263
You'll see that in competitions, like ImageNet and other things like
that, using model ensembles is very common to get maximal performance.

00:52:14.488 --> 00:52:20.482
You can actually get a little bit creative with this. Sometimes rather
than training separate models independently, you can just keep multiple

00:52:20.482 --> 00:52:25.928
snapshots of your model during the course of
training, and then use these as your ensembles.

00:52:25.928 --> 00:52:29.804
Then you still, at test time, need to average
the predictions of these multiple snapshots,

00:52:29.804 --> 00:52:33.244
but you can collect the snapshots during
the course of training.

00:52:34.133 --> 00:52:43.210
There's actually a very nice paper being presented at ICLR this week that kind of
has a fancy version of this idea, where we use a crazy learning rate schedule,

00:52:43.210 --> 00:52:47.996
where our learning rate goes very slow, then
very fast, then very slow, then very fast.

00:52:47.996 --> 00:52:57.631
The idea is that with this crazy learning rate schedule, then over the course of training, the model
might be able to converge to different regions in the objective landscape that all are reasonably good.

00:52:58.717 --> 00:53:05.532
If you do an ensemble over these different snapshots, then you can improve your
performance quite nicely, even though you're only training the model once.

00:53:05.532 --> 00:53:11.198
Questions?
- [Student] [speaks too low to hear]

00:53:25.388 --> 00:53:33.413
- The question is, it's bad when there's a large gap between error 'cause that
means you're overfitting, but if there's no gap, then is that also maybe bad?

00:53:33.413 --> 00:53:37.446
Do we actually want some small,
optimal gap between the two?

00:53:37.446 --> 00:53:39.132
We don't really care about the gap.

00:53:39.132 --> 00:53:44.019
What we really care about is maximizing
the performance on the validation set.

00:53:44.019 --> 00:53:54.995
What tends to happen is that if you don't see a gap, then you could have improved
your absolute performance, in many cases, by overfitting a little bit more.

00:53:54.995 --> 00:54:02.720
There's this weird correlation between the absolute performance on the validation
set and the size of that gap. We only care about absolute performance.

00:54:02.720 --> 00:54:03.735
Question in the back?

00:54:03.735 --> 00:54:07.004
- [Student] Are hyperparameters the same
for the ensemble?

00:54:07.004 --> 00:54:09.528
- Are the hyperparameters the same
for the ensembles?

00:54:09.528 --> 00:54:12.234
That's a good question.
Sometimes they're not.

00:54:12.234 --> 00:54:19.614
You might want to try different sizes of the model, different learning rates,
different regularization strategies and ensemble across these different things.

00:54:19.614 --> 00:54:22.614
That actually does happen sometimes.

00:54:23.496 --> 00:54:31.769
Another little trick you can do sometimes is that during training, you might actually
keep an exponentially decaying average of your parameter vector itself to kind of have

00:54:31.769 --> 00:54:35.778
a smooth ensemble of your own network
during training.

00:54:35.778 --> 00:54:41.649
Then use this smoothly decaying average of your parameter
vector, rather than the actual checkpoints themselves.

00:54:41.649 --> 00:54:45.262
This is called Polyak averaging,
and it sometimes helps a little bit.

00:54:45.262 --> 00:54:50.838
It's just another one of these small tricks you can
sometimes add, but it's not maybe too common in practice.

00:54:50.838 --> 00:54:55.778
Another question you might have is that how can we
actually improve the performance of single models?

00:54:57.229 --> 00:55:02.503
When we have ensembles, we still need to run, like,
10 models at test time. That's not so great.

00:55:02.503 --> 00:55:06.219
We really want some strategies to improve
the performance of our single models.

00:55:06.219 --> 00:55:08.237
That's really this idea of regularization,

00:55:08.237 --> 00:55:11.954
where we add something to our model to
prevent it from fitting the training data

00:55:11.954 --> 00:55:16.203
too well in the attempts to make
it perform better on unseen data.

00:55:16.203 --> 00:55:23.515
We've seen a couple ideas, a couple methods for regularization
already, where we add some explicit extra term to the loss.

00:55:23.515 --> 00:55:29.738
Where we have this one term telling the model to fit the
data, and another term that's a regularization term.

00:55:29.738 --> 00:55:33.032
You saw this in homework one,
where we used L2 regularization.

00:55:34.804 --> 00:55:43.001
As we talked about in lecture a couple lectures ago, this L2 regularization
doesn't really make maybe a lot of sense in the context of neural networks.

00:55:43.922 --> 00:55:47.982
Sometimes we use other
things for neural networks.

00:55:47.982 --> 00:55:53.376
One regularization strategy that's super, super
common for neural networks is this idea of dropout.

00:55:53.376 --> 00:55:55.080
Dropout is super simple.

00:55:55.080 --> 00:56:02.264
Every time we do a forward pass through the network, at every
layer, we're going to randomly set some neurons to zero.

00:56:02.264 --> 00:56:08.688
Every time we do a forward pass, we'll set a different random subset
of the neurons to zero. This kind of proceeds one layer at a time.

00:56:08.688 --> 00:56:15.193
We run through one layer, we compute the value of the layer, we randomly
set some of them to zero, and then we continue up through the network.

00:56:15.193 --> 00:56:22.445
Now if you look at this fully connected network on the left versus
a dropout version of the same network on the right, you can see

00:56:22.445 --> 00:56:30.400
that after we do dropout, it kind of looks like a smaller version of
the same network, where we're only using some subset of the neurons.

00:56:30.400 --> 00:56:35.746
This subset that we use varies at
each iteration, at each forward pass.

00:56:35.746 --> 00:56:36.732
Question?

00:56:36.732 --> 00:56:40.899
- [Student] [speaks too low to hear]

00:56:43.694 --> 00:56:46.375
- The question is what are we setting
to zero? It's the activations.

00:56:46.375 --> 00:56:51.731
Each layer is computing previous activation times
the weight matrix gives you our next activation.

00:56:51.731 --> 00:57:01.592
Then you just take that activation, set some of them to zero, and then your next layer
will be partially zeroed activations times another matrix give you your next activations.

00:57:01.592 --> 00:57:03.155
Question?

00:57:03.155 --> 00:57:06.702
- [Student] [speaks too low to hear]

00:57:06.702 --> 00:57:08.751
- Question is which
layers do you do this on?

00:57:08.751 --> 00:57:14.454
It's more common in fully connected layers, but you
sometimes see this in convolutional layers, as well.

00:57:14.454 --> 00:57:23.423
When you're working in convolutional layers, sometimes instead of dropping each
activation randomly, instead you sometimes might drop entire feature maps randomly.

00:57:24.455 --> 00:57:30.117
In convolutions, you have this channel dimension, and you
might drop out entire channels, rather than random elements.

00:57:32.059 --> 00:57:38.480
Dropout is kind of super simple in practice. It only
requires adding two lines, one line per dropout call.

00:57:38.480 --> 00:57:41.572
Here we have a three-layer neural network,
and we've added dropout.

00:57:41.572 --> 00:57:49.460
You can see that all we needed to do was add this extra line where we
randomly set some things to zero. This is super easy to implement.

00:57:49.460 --> 00:57:52.138
But the question is why
is this even a good idea?

00:57:52.138 --> 00:57:58.067
We're seriously messing with the network at training
time by setting a bunch of its values to zero.

00:57:58.067 --> 00:58:00.988
How can this possibly make sense?

00:58:00.988 --> 00:58:08.532
One sort of slightly hand wavy idea that people have is
that dropout helps prevent co-adaptation of features.

00:58:09.622 --> 00:58:15.853
Maybe if you imagine that we're trying to classify cats,
maybe in some universe, the network might learn one neuron

00:58:15.853 --> 00:58:21.066
for having an ear, one neuron for having a
tail, one neuron for the input being furry.

00:58:21.066 --> 00:58:24.751
Then it kind of combines these things
together to decide whether or not it's a cat.

00:58:24.751 --> 00:58:32.831
But now if we have dropout, then in making the final decision about
catness, the network cannot depend too much on any of these one features.

00:58:32.831 --> 00:58:37.725
Instead, it kind of needs to distribute its
idea of catness across many different features.

00:58:37.725 --> 00:58:42.205
This might help prevent
overfitting somehow.

00:58:42.205 --> 00:58:50.347
Another interpretation of dropout that's come out a little bit more recently
is that it's kind of like doing model ensembling within a single model.

00:58:51.690 --> 00:58:58.745
If you look at the picture on the left, after you apply dropout to the network,
we're kind of computing this subnetwork using some subset of the neurons.

00:58:58.745 --> 00:59:03.391
Now every different potential dropout mask
leads to a different potential subnetwork.

00:59:03.391 --> 00:59:09.145
Now dropout is kind of learning a whole ensemble of
networks all at the same time that all share parameters.

00:59:09.145 --> 00:59:13.790
By the way, because of the number of potential
dropout masks grows exponentially in the number

00:59:13.790 --> 00:59:17.152
of neurons, you're never going to sample
all of these things.

00:59:18.089 --> 00:59:24.788
This is really a gigantic, gigantic ensemble of
networks that are all being trained simultaneously.

00:59:25.622 --> 00:59:29.128
Then the question is what
happens at test time?

00:59:29.128 --> 00:59:34.158
Once we move to dropout, we've kind of fundamentally
changed the operation of our neural network.

00:59:34.158 --> 00:59:42.850
Previously, we've had our neural network, f, be a function of the
weights, w, and the inputs, x, and then produce the output, y.

00:59:42.850 --> 00:59:48.268
But now, our network is also taking this additional
input, z, which is some random dropout mask.

00:59:48.268 --> 00:59:52.732
That z is random. Having randomness
at test time is maybe bad.

00:59:52.732 --> 00:59:57.444
Imagine that you're working at Facebook, and you want
to classify the images that people are uploading.

00:59:57.444 --> 01:00:03.092
Then today, your image gets classified as a cat, and tomorrow
it doesn't. That would be really weird and really bad.

01:00:03.092 --> 01:00:09.323
You'd probably want to eliminate this stochasticity
at test time once the network is already trained.

01:00:09.323 --> 01:00:12.093
Then we kind of want to average out
this randomness.

01:00:12.093 --> 01:00:18.131
If you write this out, you can imagine actually marginalizing
out this randomness with some integral, but in practice,

01:00:18.131 --> 01:00:24.368
this integral is totally intractable. We don't know
how to evaluate this thing. You're in bad shape.

01:00:24.368 --> 01:00:28.073
One thing you might imagine doing is
approximating this integral via sampling,

01:00:28.073 --> 01:00:31.484
where you draw multiple samples of z
and then average them out at test time,

01:00:31.484 --> 01:00:36.040
but this still would introduce some
randomness, which is little bit bad.

01:00:36.040 --> 01:00:41.423
Thankfully, in the case of dropout, we can actually
approximate this integral in kind of a cheap way locally.

01:00:41.423 --> 01:00:47.228
If we consider a single neuron, the output is a, the
inputs are x and y, with two weights, w one, w two.

01:00:47.228 --> 01:00:52.622
Then at test time, our value a is just
w one x plus w two y.

01:00:53.590 --> 01:01:00.645
Now imagine that we trained to this network. During training,
we used dropout with probability 1/2 of dropping our neurons.

01:01:00.645 --> 01:01:06.317
Now the expected value of a during training, we can
kind of compute analytically for this small case.

01:01:07.712 --> 01:01:12.249
There's four possible dropout masks, and we're going
to average out the values across these four masks.

01:01:12.249 --> 01:01:18.204
We can see that the expected value of a
during training is 1/2 w one x plus w two y.

01:01:19.075 --> 01:01:29.000
There's this disconnect between this average value of w one x plus w two y
at test time, and at training time, the average value is only 1/2 as much.

01:01:29.000 --> 01:01:34.883
One cheap thing we can do is that at test
time, we don't have any stochasticity.

01:01:34.883 --> 01:01:40.736
Instead, we just multiply this output by the dropout
probability. Now these expected values are the same.

01:01:40.736 --> 01:01:44.733
This is kind of like a local cheap
approximation to this complex integral.

01:01:44.733 --> 01:01:48.576
This is what people really commonly do
in practice with dropout.

01:01:49.715 --> 01:01:56.269
At dropout, we have this predict function, and we just
multiply our outputs of the layer by the dropout probability.

01:01:56.269 --> 01:01:59.393
The summary of dropout is that it's
really simple on the forward pass.

01:01:59.393 --> 01:02:03.807
You're just adding two lines to your
implementation to randomly zero out some nodes.

01:02:03.807 --> 01:02:10.209
Then at the test time prediction function, you just
added one little multiplication by your probability.

01:02:10.209 --> 01:02:16.613
Dropout is super simple. It tends to work well
sometimes for regularizing neural networks.

01:02:16.613 --> 01:02:21.454
By the way, one common trick you see
sometimes is this idea of inverted dropout.

01:02:22.665 --> 01:02:28.735
Maybe at test time, you care more about efficiency, so you
want to eliminate that extra multiplication by p at test time.

01:02:28.735 --> 01:02:37.677
Then what you can do is, at test time, you use the entire weight matrix, but now at
training time, instead you divide by p because training is probably happening on a GPU.

01:02:37.677 --> 01:02:44.733
You don't really care if you do one extra multiply at training time, but then
at test time, you kind of want this thing to be as efficient as possible.

01:02:44.733 --> 01:02:45.566
Question?

01:02:46.416 --> 01:02:56.777
- [Student] [speaks too low to hear]
Now the gradient [speaks too low to hear].

01:02:57.678 --> 01:03:02.212
- The question is what happens to the
gradient during training with dropout?

01:03:02.212 --> 01:03:06.583
You're right. We only end up propagating the
gradients through the nodes that were not dropped.

01:03:06.583 --> 01:03:15.356
This has the consequence that when you're training with dropout, typically training
takes longer because at each step, you're only updating some subparts of the network.

01:03:15.356 --> 01:03:22.287
When you're using dropout, it typically takes longer to train,
but you might have a better generalization after it's converged.

01:03:24.409 --> 01:03:32.810
Dropout, we kind of saw is like this one concrete instantiation. There's a
little bit more general strategy for regularization where during training

01:03:32.810 --> 01:03:37.482
we add some kind of randomness to the network to
prevent it from fitting the training data too well.

01:03:37.482 --> 01:03:41.037
To kind of mess it up and prevent it
from fitting the training data perfectly.

01:03:41.037 --> 01:03:46.160
Now at test time, we want to average out all that
randomness to hopefully improve our generalization.

01:03:46.160 --> 01:03:53.927
Dropout is probably the most common example of this type of strategy,
but actually batch normalization kind of fits this idea, as well.

01:03:53.927 --> 01:04:00.755
Remember in batch normalization, during training, one data point might
appear in different mini batches with different other data points.

01:04:00.755 --> 01:04:07.200
There's a bit of stochasticity with respect to a single data point
with how exactly that point gets normalized during training.

01:04:07.200 --> 01:04:14.735
But now at test time, we kind of average out this stochasticity by using some
global estimates to normalize, rather than the per mini batch estimates.

01:04:14.735 --> 01:04:20.223
Actually batch normalization tends to have kind of a similar
regularizing effect as dropout because they both introduce

01:04:20.223 --> 01:04:25.478
some kind of stochasticity or noise at training
time, but then average it out at test time.

01:04:25.478 --> 01:04:35.744
Actually, when you train networks with batch normalization, sometimes you don't use dropout at
all, and just the batch normalization adds enough of a regularizing effect to your network.

01:04:35.744 --> 01:04:43.833
Dropout is somewhat nice because you can actually tune the regularization strength
by varying that parameter p, and there's no such control in batch normalization.

01:04:43.833 --> 01:04:48.928
Another kind of strategy that fits in this
paradigm is this idea of data augmentation.

01:04:48.928 --> 01:04:57.078
During training, in a vanilla version for training, we have our data,
we have our label. We use it to update our CNN at each time step.

01:04:57.078 --> 01:05:03.555
But instead, what we can do is randomly transform the image
in some way during training such that the label is preserved.

01:05:03.555 --> 01:05:09.418
Now we train on these random transformations
of the image rather than the original images.

01:05:09.418 --> 01:05:16.153
Sometimes you might see random horizontal flips 'cause if
you take a cat and flip it horizontally, it's still a cat.

01:05:17.690 --> 01:05:23.763
You'll randomly sample crops of different sizes from the
image because the random crop of the cat is still a cat.

01:05:25.188 --> 01:05:30.317
Then during testing, you kind of average out
this stochasticity by evaluating with some

01:05:30.317 --> 01:05:34.309
fixed set of crops, often the four corners
and the middle and their flips.

01:05:34.309 --> 01:05:38.041
What's very common is that when you read, for
example, papers on ImageNet, they'll report

01:05:38.041 --> 01:05:47.308
a single crop performance of their model, which is just like the whole image, and a 10
crop performance of their model, which are these five standard crops plus their flips.

01:05:48.238 --> 01:05:56.345
Also with data augmentation, you'll sometimes use color jittering, where you
might randomly vary the contrast or brightness of your image during training.

01:05:56.345 --> 01:06:04.642
You can get a little bit more complex with color jittering, as well, where you try to
make color jitters that are maybe in the PCA directions of your data space or whatever,

01:06:04.642 --> 01:06:11.456
where you do some color jittering in some data-dependent
way, but that's a little bit less common.

01:06:12.492 --> 01:06:18.037
In general, data augmentation is this really general
thing that you can apply to just about any problem.

01:06:18.037 --> 01:06:24.940
Whatever problem you're trying to solve, you kind of think about what
are the ways that I can transform my data without changing the label?

01:06:24.940 --> 01:06:31.218
Now during training, you just apply these random transformations
to your input data. This sort of has a regularizing effect

01:06:31.218 --> 01:06:38.954
on the network because you're, again, adding some kind of stochasticity
during training, and then marginalizing it out at test time.

01:06:40.055 --> 01:06:45.232
Now we've seen three examples of this pattern,
dropout, batch normalization, data augmentation,

01:06:45.232 --> 01:06:47.154
but there's many other examples, as well.

01:06:47.154 --> 01:06:53.049
Once you have this pattern in your mind, you'll kind of
recognize this thing as you read other papers sometimes.

01:06:53.049 --> 01:06:56.722
There's another kind of related
idea to dropout called DropConnect.

01:06:56.722 --> 01:07:06.265
With DropConnect, it's the same idea, but rather than zeroing out the activations at every
forward pass, instead we randomly zero out some of the values of the weight matrix instead.

01:07:06.265 --> 01:07:09.652
Again, it kind of has this similar flavor.

01:07:09.652 --> 01:07:16.281
Another kind of cool idea that I like, this one's not so
commonly used, but I just think it's a really cool idea,

01:07:16.281 --> 01:07:19.400
is this idea of fractional max pooling.

01:07:19.400 --> 01:07:29.067
Normally when you do two-by-two max pooling, you have these fixed two-by-two regions
over which you pool over in the forward pass, but now with fractional max pooling,

01:07:29.067 --> 01:07:35.851
every time we have our pooling layer, we're going to randomize
exactly the pool that the regions over which we pool.

01:07:35.851 --> 01:07:43.070
Here in the example on the right, I've shown three different sets
of random pooling regions that you might see during training.

01:07:43.070 --> 01:07:48.857
Now during test time, you kind of average the
stochasticity out by trying many different,

01:07:48.857 --> 01:07:54.704
by either sticking to some fixed set of pooling regions.
or drawing many samples and averaging over them.

01:07:54.704 --> 01:07:59.027
That's kind of a cool idea, even though
it's not so commonly used.

01:07:59.027 --> 01:08:05.890
Another really kind of surprising paper in this paradigm that
actually came out in the last year, so this is new since

01:08:05.890 --> 01:08:09.911
the last time we taught the class,
is this idea of stochastic depth.

01:08:09.911 --> 01:08:15.490
Here we have a network on the left. The
idea is that we have a very deep network.

01:08:15.490 --> 01:08:18.530
We're going to randomly drop layers
from the network during training.

01:08:18.530 --> 01:08:24.113
During training, we're going to eliminate some layers
and only use some subset of the layers during training.

01:08:24.114 --> 01:08:26.854
Now during test time, we'll
use the whole network.

01:08:26.854 --> 01:08:30.251
This is kind of crazy.
It's kind of amazing that this works,

01:08:30.251 --> 01:08:35.310
but this tends to have kind of a similar regularizing
effect as dropout and these other strategies.

01:08:35.310 --> 01:08:42.041
But again, this is super, super cutting-edge research. This is
not super commonly used in practice, but it is a cool idea.

01:08:44.694 --> 01:08:52.673
Any last minute questions about regularization?
No? Use it. It's a good idea. Yeah?

01:08:52.673 --> 01:08:57.046
- [Student] [speaks too low to hear]

01:08:57.046 --> 01:09:01.184
- The question is do you usually use
more than one regularization method?

01:09:04.325 --> 01:09:09.751
You should generally be using batch normalization as kind of
a good thing to have in most networks nowadays because it

01:09:09.752 --> 01:09:12.650
helps you converge, especially
for very deep things.

01:09:12.650 --> 01:09:25.204
In many cases, batch normalization alone tends to be enough, but then sometimes if batch normalization alone
is not enough, then you can consider adding dropout or other thing once you see your network overfitting.

01:09:25.204 --> 01:09:28.526
You generally don't do a blind
cross-validation over these things.

01:09:28.526 --> 01:09:33.942
Instead, you add them in in a targeted way
once you see your network is overfitting.

01:09:36.400 --> 01:09:38.981
One quick thing, it's this
idea of transfer learning.

01:09:38.981 --> 01:09:47.018
We've kind of seen with regularization, we can help reduce the gap between
train and test error by adding these different regularization strategies.

01:09:48.903 --> 01:09:53.012
One problem with overfitting is sometimes you
overfit 'cause you don't have enough data.

01:09:53.012 --> 01:10:00.444
You want to use a big, powerful model, but that big, powerful
model just is going to overfit too much on your small dataset.

01:10:00.444 --> 01:10:05.909
Regularization is one way to combat that, but
another way is through using transfer learning.

01:10:05.909 --> 01:10:12.730
Transfer learning kind of busts this myth that you don't
need a huge amount of data in order to train a CNN.

01:10:12.730 --> 01:10:15.300
The idea is really simple.

01:10:15.300 --> 01:10:20.798
You'll maybe first take some CNN.
Here is kind of a VGG style architecture.

01:10:20.798 --> 01:10:25.031
You'll take your CNN, you'll train it
in a very large dataset, like ImageNet,

01:10:25.031 --> 01:10:28.039
where you actually have enough data
to train the whole network.

01:10:28.039 --> 01:10:34.596
Now the idea is that you want to apply the features from
this dataset to some small dataset that you care about.

01:10:34.596 --> 01:10:42.864
Maybe instead of classifying the 1,000 ImageNet categories, now you want to
classify 10 dog breeds or something like that. You only have a small dataset.

01:10:42.864 --> 01:10:45.917
Here, our small dataset
only has C classes.

01:10:45.917 --> 01:10:58.135
Then what you'll typically do is for this last fully connected layer that is going from the last
layer features to the final class scores, this now, you need to reinitialize that matrix randomly.

01:10:59.651 --> 01:11:02.952
For ImageNet, it was a 4,096-by-1,000
dimensional matrix.

01:11:02.952 --> 01:11:09.182
Now for your new classes, it might
be 4,096-by-C or by 10 or whatever.

01:11:09.182 --> 01:11:13.985
You reinitialize this last matrix randomly,
freeze the weights of all the previous layers

01:11:13.985 --> 01:11:21.947
and now just basically train a linear classifier, and only train the
parameters of this last layer and let it converge on your data.

01:11:23.788 --> 01:11:28.756
This tends to work pretty well if you only
have a very small dataset to work with.

01:11:28.756 --> 01:11:35.166
Now if you have a little bit more data, another thing
you can try is actually fine tuning the whole network.

01:11:35.166 --> 01:11:44.935
After that top layer converges and after you learn that last layer for your data,
then you can consider actually trying to update the whole network, as well.

01:11:44.935 --> 01:11:49.434
If you have more data, then you might consider
updating larger parts of the network.

01:11:49.434 --> 01:11:56.143
A general strategy here is that when you're updating the network,
you want to drop the learning rate from its initial learning rate

01:11:56.143 --> 01:12:02.973
because probably the original parameters in this network that
converged on ImageNet probably worked pretty well generally,

01:12:02.973 --> 01:12:08.605
and you just want to change them a very small
amount to tune performance for your dataset.

01:12:08.605 --> 01:12:15.490
Then when you're working with transfer learning, you kind of imagine
this two-by-two grid of scenarios where on the one side, you have

01:12:15.490 --> 01:12:20.113
maybe very small amounts of data for your dataset,
or very large amount of data for your dataset.

01:12:21.188 --> 01:12:28.780
Then maybe your data is very similar to images. Like, ImageNet
has a lot of pictures of animals and plants and stuff like that.

01:12:28.780 --> 01:12:35.335
If you want to just classify other types of animals and plants and
other types of images like that, then you're in pretty good shape.

01:12:35.335 --> 01:12:48.861
Then generally what you do is if your data is very similar to something like ImageNet, if you have a very small amount
of data, you can just basically train a linear classifier on top of features, extracted using an ImageNet model.

01:12:48.861 --> 01:12:54.786
If you have a little bit more data to work with,
then you might imagine fine tuning your data.

01:12:54.786 --> 01:12:58.755
However, you sometimes get in trouble if your
data looks very different from ImageNet.

01:12:58.755 --> 01:13:06.781
Maybe if you're working with maybe medical images that are X-rays or CAT scans
or something that looks very different from images in ImageNet, in that case,

01:13:06.781 --> 01:13:09.072
you maybe need to get a
little bit more creative.

01:13:09.072 --> 01:13:14.408
Sometimes it still works well here, but those
last layer features might not be so informative.

01:13:14.408 --> 01:13:21.507
You might consider reinitializing larger parts of the network and
getting a little bit more creative and trying more experiments here.

01:13:21.507 --> 01:13:29.015
This is somewhat mitigated if you have a large amount of data in your very different
dataset 'cause then you can actually fine tune larger parts of the network.

01:13:29.015 --> 01:13:32.587
Another point I'd like to make is this idea
of transfer learning is super pervasive.

01:13:32.587 --> 01:13:35.660
It's actually the norm,
rather than the exception.

01:13:35.660 --> 01:13:40.562
As you read computer vision papers, you'll often
see system diagrams like this for different tasks.

01:13:40.562 --> 01:13:44.706
On the left, we're working with object detection.
On the right, we're working with image captioning.

01:13:44.706 --> 01:13:48.387
Both of these models have a CNN
that's kind of processing the image.

01:13:48.387 --> 01:13:53.913
In almost all applications of computer vision these days,
most people are not training these things from scratch.

01:13:53.913 --> 01:13:59.973
Almost always, that CNN will be pretrained on ImageNet,
and then potentially fine tuned for the task at hand.

01:13:59.973 --> 01:14:07.089
Also, in the captioning sense, sometimes you can actually
pretrain some word vectors relating to the language, as well.

01:14:07.089 --> 01:14:14.143
You maybe pretrain the CNN on ImageNet, pretrain some word vectors on a
large text corpus, and then fine tune the whole thing for your dataset.

01:14:14.143 --> 01:14:22.278
Although in the case of captioning, I think this pretraining with word
vectors tends to be a little bit less common and a little bit less critical.

01:14:22.278 --> 01:14:33.225
The takeaway for your projects, and more generally as you work on different models, is that whenever you have
some large dataset, whenever you have some problem that you want to tackle, but you don't have a large dataset,

01:14:33.225 --> 01:14:41.673
then what you should generally do is download some pretrained model
that's relatively close to the task you care about, and then either

01:14:41.673 --> 01:14:44.859
reinitialize parts of that model or
fine tune that model for your data.

01:14:44.859 --> 01:14:50.676
That tends to work pretty well, even if you have
only a modest amount of training data to work with.

01:14:50.676 --> 01:14:57.738
Because this is such a common strategy, all of the different deep learning
software packages out there provide a model zoo where you can just download

01:14:57.738 --> 01:15:01.099
pretrained versions of various models.

01:15:01.099 --> 01:15:06.043
In summary today, we talked about optimization,
which is about how to improve the training loss.

01:15:06.043 --> 01:15:10.884
We talked about regularization, which is
improving your performance on the test data.

01:15:10.884 --> 01:15:12.838
Model ensembling kind of fit into there.

01:15:12.838 --> 01:15:17.440
We also talked about transfer learning, which is
how you can actually do better with less data.

01:15:17.440 --> 01:15:21.940
These are all super useful strategies. You
should use them in your projects and beyond.

01:15:21.940 --> 01:15:25.238
Next time, we'll talk more concretely about some
of the different deep learning software pakages out there